-
Notifications
You must be signed in to change notification settings - Fork 593
HDDS-9718. Add performance audit logging for S3G #5644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xichen01 for the patch.
I'm not sure audit log was intended for such purposes, but since every operation is (supposed to be) logged there, including per-operation performance metrics may make sense.
hadoop-hdds/framework/src/main/java/org/apache/hadoop/ozone/audit/AuditMessage.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/s3/endpoint/BucketEndpoint.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/s3/endpoint/BucketEndpoint.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/framework/src/main/java/org/apache/hadoop/ozone/audit/AuditMessage.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/s3/endpoint/ObjectEndpoint.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/s3/endpoint/ObjectEndpoint.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/s3gateway/src/main/java/org/apache/hadoop/ozone/audit/S3GAction.java
Outdated
Show resolved
Hide resolved
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xichen01 , thanks for working on this! It is inefficient to first construct a Map and then convert it to a String. We should build the String directly; see the comment inlined.
BTW, the change is quite long. See if you could split this into two tasks (optional).
| // Set to an Array so that S3GAction can be changed within the function. | ||
| final S3GAction[] s3GAction = new S3GAction[] {S3GAction.CREATE_KEY}; | ||
| boolean auditSuccess = true; | ||
| Map<String, String> perf = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a PerformanceStringBuilder to build the performance string instead of a Map.
//AuditLogger.java
public static class PerformanceStringBuilder {
private final StringBuilder builder = new StringBuilder(1024).append('[');
public void appendNanos(String name,long nanos){
append(name, TimeUnit.NANOSECONDS.toMillis(nanos));
}
public void append(String name,long value){
builder.append(name)
.append('=')
.append(value)
.append(", ");
}
public String build() {
final int length = builder.length();
if (length < 2) {
return "[]";
}
builder.setCharAt(length - 2, ']');
builder.setLength(length - 1);
return builder.toString();
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been implemented
|
Thanks @xichen01 for updating the patch. |
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 the change looks good.
| @QueryParam("uploadId") @DefaultValue("") String uploadID, | ||
| InputStream body) throws IOException, OS3Exception { | ||
| long startNanos = Time.monotonicNowNanos(); | ||
| // Set to an Array so that S3GAction can be changed within the function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment seems out-dated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
|
||
| //ObjectEndpoint | ||
| CREATE_MULTIPART_KEY, | ||
| CREATE_MULTIPART_KEY_BY_COPY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call it COPY_MULTIPART_KEY?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm referencing the AWS API name
CREATE_MULTIPART_KEY: AWS API name UploadPart
CREATE_MULTIPART_KEY_BY_COPY: AWS API name UploadPartCopy
I think it's a better match, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Let's keep it then.
|
@duongkame @kerneltime would you like to take a look? |
duongkame
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure audit log was intended for such purposes, but since every operation is (supposed to be) logged there, including per-operation performance metrics may make sense.
Indeed, not every operation is logged by default, but just the write ones.
What is the use of the metrics in individual log entries? I guess it may come from the fact that our metrics only support avg for now and outliners such as p90, p99 are missed. But isn't it fixed as per #5627.
| getMetrics().incListKeyCount(keyCount); | ||
| perf.appendCount(keyCount); | ||
| perf.appendOpLatencyNanos(opLatencyNs); | ||
| AUDIT.logReadSuccess(buildAuditMessageForSuccess(s3GAction, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AUDIT may not log reads because of performance, but the perf does always build(). Should the performance data be passed as a lamba instead, so that a new string is not actually built when read audi logs are not enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a problem that deserves attention
We currently generate AuditMessages via buildAuditMessageForSuccess, etc. In the AuditMessageForSuccess we perform the build of all relevant messages, regardless of whether or not the logs are logged.
This is probably not just a perf problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the problem also applies to the existing, "functional" part of the audit message.
Audit logging is enabled by default, and we need several changes to improve it, so I think we should do it separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reopened HDDS-2638 for the improvement of audit message processsing/allocations.
by default, the audit log is enabled to log "read and write" operations
Part of the reason is yes, more in order to record the processing of S3G in more detail, when some users react to some key get/put/list timeout, slower, we can based on the key provided by the user, we know exactly how long each step takes, which is very helpful in locating the performance problem. The time taken to perform each step for a particular key is difficult to obtain from the Metrics. |
|
Thanks @xichen01 for the patch, @duongkame, @guohao-rosicky, @szetszwo for the review. |
What changes were proposed in this pull request?
Let the audit log record the detailed execution time
metadataLatencyMs/...createKeyLatencyMs/getKeyLatencyMs/getBucketKeyCount/...putLength/getLength/copyLength/...Example:
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-9718
How was this patch tested?
Manual test, Existing test